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ABSTRACT 

How does information flow in online social networks? How 
does the structure and size of the information cascade evolve 
in time? How can we efficiently mine the information con- 
tained in cascade dynamics? We approach these questions 
empirically and present an efficient and scalable mathemat- 
ical framework for quantitative analysis of cascades on net- 
works. We define a cascade generating function that cap- 
tures the details of the microscopic dynamics of the cas- 
cades. We show that this function can also be used to com- 
pute the macroscopic properties of cascades, such as their 
size, spread, diameter, number of paths, and average path 
length. We present an algorithm to efficiently compute cas- 
cade generating function and demonstrate that while signif- 
icantly compressing information within a cascade, it nev- 
ertheless allows us to accurately reconstruct its structure. 
We use this framework to study information dynamics on 
the social network of Digg. Digg allows users to post and 
vote on stories, and easily see the stories that friends have 
voted on. As a story spreads on Digg through voting, it 
generates cascades. We extract cascades of more than 3,500 
Digg stories and calculate their macroscopic and microscopic 
properties. We identify several trends in cascade dynamics: 
spreading via chaining, branching and community. We dis- 
cuss how these affect the spread of the story through the 
Digg social network. Our computational framework is gen- 
eral and offers a practical solution to quantitative analysis 
of the microscopic structure of even very large cascades. 
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1. INTRODUCTION 

Throughout history, the flow of ideas and innovation has 
led to vast cultural, economic, and political changes. Social 
scientists have studied this phenomenon in detail in several 
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different settings [15] and found that ideas and innovations 
tend to diffuse along social links. First, an innovator adopts 
a novel idea or practice, then people in contact with the in- 
novator adopt it, then people in contact with those people, 
and so on. In this way, information cascades on a social net- 
work. Not surprisingly, information cascades are also com- 
mon in online social networks. They are created, for exam- 
ple, when an individual forwards an email she receives to her 

news item to her followers on 



contacts 
Twitter 



16 17 



18 



or retweets ; 
Understanding how information spreads in on- 
line social networks may be indicative of its quality |19[ |20| . 
A mathematical tool for analysis of cascades can find exten- 
sive use anywhere where cascades are studied: anomaly and 
spam detection, information classification, viral marketing, 
epidemiological studies, computer virus spread, political and 
social unrest and even power transmission failure JsJ [9] . 

Availability of large scale data about human behavior in 
online social networks has enabled computational scientists 
to investigate what drives information diffusion and sug- 
gest mechanisms to facilitate its spread. However, as in any 
other field of research, there are two distinct ways of tack- 
ling this problem: model-centric or empirical. Model-centric 
approaches make certain assumptions about how individu- 
als participating in a cascade are affected by their neigh- 
bors (independent cascade or threshold model). Using these 
models, researchers have tried to infer global properties of 
information cascades in social networks |23[ [9] , devise effi- 
cient methods to infer the underlying network structure |13[ 
TT] or maximize cascade size |22[ |2l] , and identify influential 
spreaders 14 . However, empirical approaches are needed to 
validate assumptions made by these models. We need prin- 
cipled mathematical tools to quantitatively characterize the 
temporal and spatial properties of cascades as they occur in 
real- world networks. However, to the best of our knowledge, 
no previous work has attempted to quantify the dynamics 
of information cascades on social networks or characterize 
their microscopic growth. At most, researchers have visu- 
alized the shape of cascades [17] or enumerated their com- 
monly observed patterns [l]. Such approaches do not scale 
to even moderately large cascades. 

To address this gap, we propose a practical, general, and 
scalable quantitative framework for the analysis of cascades 
on social networks that is applicable even to large cascades. 
We define a cascade generating function, which captures the 
details of the dynamics of information diffusion on networks. 
We can use this function to (1) compute the macroscopic 
properties of the cascade, such as its size, diameter, average 
path length, etc., (2) reconstruct the shape of the cascade, 



and (3) analyze its microscopic dynamic properties. The 
cascade generating function is a good signature [5] of the 
contagion process occurring on a network. It could help us 
identify patterns, trends, and anomalies within the cascades 
in near real-time. It could aid spam filtering, since the flow 
of spam messages within a network will be different from the 
flow of valid information. It could be useful for viral mar- 
keting, since it can help us discover the signature of trends 
that become popular as compared to those which do not. 

As the size of cascades grows, storing their complete struc- 
ture may not be feasible. However, the cascade generating 
function can approximate the structure of the cascade with 
very high accuracy, in spite of having pseudo-linear space 
complexity. Hence, the cascade generating function can pro- 
vide efficient compression of the information in a cascade. 

This paper makes the following contributions. In Sec- 
tion [2] we describe a general mathematical framework for 
representing and quantitatively analyzing cascades on social 
networks. Specifically, in Section |2.1| we define the cas- 
cade generating function, which describes how information 
spreads through the network. We show that this function 
can be used to compute cascades' macroscopic properties, 
such as its size, diameter, number of paths in the cascade, 
etc. In Section |2.2| we present a fast, efficient algorithm to 
compute this function, having O(kdN) runtime complexity 
and O(kN) space complexity in its naive implementation, 
where N is the number of nodes participating in a cascade, 
d is the maximum degree of any node and k is the number of 
independent cascade seeds. We demonstrate the use of cas- 
cade generating function to study dynamics of cascades in 
Section [23] We illustrate the framework on simple cascades 
often observed in online social networks. In Section[S]we also 
apply it to study large information cascades occurring on a 
real- world social network of Digg (http://digg.com). This 
site allows people to submit and vote for news stories, and 
also to create links to other people in order to see what new 
stories they have recently voted for. Stories propagate on 
Digg's social network through a series of cascades as users 
influence their fans to vote for the story [18| 12 . We study 



the distribution of several macroscopic properties of these 
cascades. In addition, we study the microscopic dynam- 
ics of their temporal evolution. Time plots of the cascade 
generating function show several characteristic signatures of 
cascade growth, such as star-like, chain-like and community- 
like growth. 



2. A FRAMEWORK FOR ANALYZING CAS- 
CADES 

Consider a social network, represented by a graph G(V, E) 
with V(\V\ = N) nodes and E(\E\ = M) directed edges. If 
node a wants to watch activities of node b, she must create 
an edge to b by designating b as a friend. We call a a fan 
(or follower) of b. Figure [TJ a) shows a directed network in 
which node 4 is a fan of 1 and 2. We call an edge dj active, 
if node j is a fan of node i and node i is activated before 
node j. Information or influence flows from activated nodes 
to their fans. In the figure above, information flows from 
nodes 1 and 2 to 4. 

A cascade is a sequence of activations generated by a con- 
tagion process, in which nodes cause connected nodes to 
be activated with some probability. In analogy with the 
spread of an infectious disease on a network, an infected 
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Figure 1: An toy example of an information cascade 
on a network. Nodes are labeled in the temporal or- 
der in which they are activated by the cascade. The 
nodes that are never activated are blank, (a) The 
edges show the underlying friendship network. Edge 
direction shows the semantics of the connection, i.e., 
nodes are watching others to which they are point- 
ing, (b) Two cascades on the network (shown in 
yellow and red). Node 1 is the seed of the first (yel- 
low) cascade and node 2 is the seed of the second 
(red) cascade. Node 4 belongs to both cascades and 
is shown in orange. 



(activated) node exposes his fans to the infection. Disease 
cascades through the network as exposed fans become in- 
fected, thereby exposing their own fans to the disease, and 
so on. The seed of a cascade is the node that initiates the 
cascade. In information cascades the seed is an indepen- 
dent originator of information, who then influences others 
to adopt, endorse, or transmit that information. We call a 
node that participates in a cascade a member of the cas- 
cade. A contagion process can generate multiple cascades, 
and a node can participate in more than one cascade, re- 
sulting in a commonly observed "collision of cascades" [T] 
phenomenon. Figure [ljb) shows cascades on the network 
shown in Fig. [TJa), in which nodes are labeled in the or- 
der they are activated, with links showing the direction of 
influence. As shown, the contagion process generates two 
cascades whose seeds are nodes 1 and 2, respectively. Node 
4 participates in both cascades. 

A cascade chain is a sequence of connected nodes partici- 
pating in a cascade. Each node in the cascade chain is influ- 
enced by all the nodes in the chain activated before it and 
influences all the successive nodes in the chain. The length 
of the longest chain is the diameter of the cascade IT]. The 
spread of the cascade is the maximal branching number of 
its participants, i.e., the maximum number of nodes a single 
member infects. The diameter of the contagion process in 
Fig. [TJb) is two (longest chain is 1 — » 3 — >• 6, the spread of 
cascade 1 (yellow) is 4 and of cascade 2 (in red) is 2. 

2.1 Characterizing Cascades 

We characterize a cascade mathematically by the cascade 
generating function, (f)(J,otj,i), which describes how activa- 
tion spreads through the network. Contagion process is 
parameterized by the transmission rates ol^% Vj, i G [l,iV], 
which give the probability that a node i activated at time U 
will activate a connected node j at a later time tj. Though, 
in principle, a^i could be different for different values of i 
and j, for simplicity, we assume that they are all the same, 
i.e., aji = a. Note, that since the nodes are labeled in the 
temporal order of their activation, cj)(j,aj,i) characterizes 



We use the contagion process shown in Fig. [TJb) to il- 
lustrate how the cascade generating function is calculated. 
The initial value of the cascade function is some constant. 
In the example, nodes 1 and 2 are seeds; therefore, the val- 
ues of the cascade function at the times they are activated 
are constant. While these values may be different, for con- 
venience we set them both to one: 0(1, a) = 0(2, a) = 1. 
The value of captures the cumulative effect on node j of 
activated nodes that are connected to j. Node 3 is con- 
nected to 1 and activated by it with probability a; there- 
fore, 0(3, a) = a0(l,a). At the time node 4 is activated, 
cascade function is 0(4, a) = a0(l,a) + a0(2,a). Nodes 
continue to activate others in this fashion. At time £6, the 
cascade function is 0(6, a) = a0(l,a) + a0(3, a). Since 
0(3, a) only depends on 0(1, a), 0(6, a) can be rewritten as 
0(6, a) — a0(l, a) + a 2 0(l, a). 

In general terms, if node i is a node activated at time U, 
the value of the cascade generating function at later time tj 
when node j is activated is: 



tained by differentiating with respect to a and evaluating 
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where friend(j) is a set of nodes connected to node j that 
are activated before it. Since links are directed, without loss 
of generality, we can assume that there are K cascades in 
a contagion process. Let 0(ii,a), 0(z2,ck), 4>(iK,ct) be 
the weights of their seeds. Then, Eq. ^reduces to 
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The value of 0(j, a) is proportional to the cumulative effect 
or influence of all cascades on node j activated at time tj and 
can be described using the vector (/(j, ii, a), • • • , f(j, ik, ot)). 
f(j, i p ,a), captures the cumulative effect of the cascade gen- 
erated at seed node i p on the node j where tj > U . In 
Fig.gb), at time £ = 4, 0(4, a) = /(4, 1, a)0(l, a)+/(4, 2, a) 
0(2, a) where /(4, l,a) = a and /(4, 2, a) = a. At time 
£ = 6, 0(6, a) = /(6, l,a)0(l,a;) + /(6, 2, a)0(2, a). Here 
/(6, 1, a) = a + a 2 and /(6, 2, a) = 0. 

If the values of the cascade generating function for nodes 
i and j are the same, </>(i,a) = 0(j, a), the nodes i and j 
are isomorphic with respect to the contagion process. Such 
nodes are structurally similar with respect to the cascade; 
therefore, the value of the cascade function is independent 
of the order in which they are activated. By structural sim- 
ilarity, we mean that in a network comprising of only the 
activated nodes and active edges between them, the topo- 
logical distance of two isomorphic from all the seeds is the 
same. Here, the topological distance of a node from the 
seed is measured in terms of the total number of attenuated 
paths over active edges. Isomorphic nodes can be grouped 
together in a tier with its own characteristic 0(a). In the 
contagion process in Fig.[TJb), nodes 3 and 7 are isomorphic 
and form a tier with value 0(a) = a0(l, a). 

Cascade properties. We can use the cascade generating 
function to compute the macroscopic properties of cascades, 
such as their size, diameter, number of paths, and their av- 
erage length. 

If we take 0(i p ,a) = 1, where i p is the seed of p th cas- 
cade activated at time U p , then the total number of paths 
from i p to node j is equal to f(j,i P , 1) in Eq. [5] The total 
length of paths from the seed i p to j, l(j,i P ), can be ob- 
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To illustrate this, consider again the contagion process shown 
in Fig. [ljb) . For example, if we pick node 6, there are two 
paths from the seed (node 1) to node 6: 1 —t 3 — »• 6 and 
1 — >• 6. The total length of these paths is three. There are 
no paths from the second seed (node 2 ) to node 6. We can 
also get this answer from 

d0(6,a), d(a + a 2 )0(l, a) 
— \a=i — ; — o. 
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We can use similar reasoning to compute other cascade 
properties. The average path length, l av is given by: 

. _ T,j EpLi Ki> i P ) _ d 0O"> a ) a 
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The diameter of the contagion process is the length of the 
longest path of any cascade generated by this process. It is 
given by modifying Eq. [4] 
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2.2 Computational Framework for 

Cascade Graph. 

For the analysis of the contagion process, we create a cas- 
cade graph G C (V C , E c ) from the original network G(V,E) 
as follows. Let V c be the number of nodes participating 
in all cascades. Let a cascade begin at time t\ and end 
at £at. We arrange and label the nodes in the temporal 
order in which they are activated, e.g., transmit informa- 
tion: 1,2, ••• , A/", where node k activated at time tk and 
t\ < • • • < tk < • • • < £at. An edge exists from j to i in G c 
(i.e. i is activated by j) if an edge exists from i to j in G (i 
is a fan of j) and U > tj. The adjacency matrix of A c of the 
cascade graph G C (V C , E c ), the cascade matrix, is: 

A c (i,j) — 1 if 3 an edge from j to i in G C (V C , E c ) and j < 
— otherwise 

We break ties randomly. If nodes a and b receive information 
at the same time tk, without loss of generality, we assume 
a — k and b = k + 1. Also, we modify the adjacency matrix 
A c , making A c (k+1 : k) = and A c (k, fc+1) = 0, irrespective 
of whether or not an edge exists between k and k + 1. This 
means that neither node can activate the other, since they 
are activated at the same time. We note that 1 is always 
the seed of a cascade. The cascade matrix can encode a 
contagion process that generates multiple cascades. 

Contagion and Length Matrix. 

In addition to the cascade matrix, we introduce the dy- 
namic adjacency matrix of the cascade graph, A(t). This is 
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Figure 2: Analysis of various cascades. Nodes are labeled by the order in which they are activated by the 
contagion process. Row (a) shows cascade plots obtained by computing the cascade generating function, <j) 
at different times. Row (b) shows the corresponding contagion process. Different cascades within the same 
contagion process are shown in different colors. Row (c) shows some of the numeric properties of the cascades, 
and row (d) shows sets of isomorphic nodes. 



a time- dependent matrix, whose non-zero elements include 
all nodes that have been activated up to time t: 

Aij{t k ) = 1 if A c (iJ) = 1 andU < t k 
= otherwise 

The dynamic adjacency matrix allows us to compute con- 
nectivity between nodes in a cascade, as measured by the 
number of paths that exist between them. Following [25] , 
let the attenuation parameter a be the probability of trans- 
mitting a message or influence along any edge from node i 
at time U to node j at time tj . The contagion matrix over 
the time period [ti,£jv]: 

C(a) = a N - 1 A(t N )---A(t 3 )A(t 2 ) + --- 

+ a 2 A(t iV )A(t i v-i)+aA(t i v) + / (6) 

The term dj (a) gives the number of attenuated paths from 
node j to i in G C (V C , E c ) and / is the identity matrix. 

The total length of paths from one node to another can 
be modeled using a formalism similar to contagion matrix. 
We define the length matrix as: 

L(a) = (N-Va^Aitrf'.'AfaAfo) (7) 
+ • • • + 2a 2 A(t N )A(t N - 1 ) + aA(t N ) + I 

Here Lij(l) gives the total length of paths from node j to 
node i in G C (V C , E c ). Lij(l)/dj(l) then gives the average 
length of paths from node j to node i. 

The first step towards quantifying cascades is seed iden- 
tification. The can be achieved by collecting all the max- 
imal elements of G c , seen as partially ordered set. Equiv- 
alently, if all the elements of the i th row of A c are zero, 
then node i is a seed of the cascade. Finding all the seeds 
gives the total number of cascades, K, in the contagion pro- 
cess. Let <fi(ii, a),-" - 5 <I>(^k, ot) be the cascade function value 
of each seed. The value of the cascade function of node 
j, which is activated after m (but before k — m) cascade 
seeds, is <p(j,a) = ^2 p n ^ 1 Cj,i p (a)(l)(i p ,a). A non-zero value 
of Cj,i p (a) indicates that j is a member of the cascade ini- 
tiated by i p . Hence, Cj,i p (a) = f(j,i p ,a) in Eq. [2] The 
efficient design of the cascade generating function is such, 
that knowing only the columns corresponding to the seeds 
in the contagion matrix, would help us to characterize the 
entire matrix. This along with the triangular nature of the 
cascade matrix enables us to calculate the contagion and 
length matrices, and hence the corresponding cascade gen- 
erating function efficiently using dynamic programming (Al- 
gorithm [l]). This algorithm has O(kN) space and O(dkN) 
runtime complexity even in its naive implementation, where, 
d is the maximum degree of any node and N is the number 
of nodes in the process. 

The contagion and length matrices together fully deter- 
mine (j> and and therefore, capture the microscopic de- 
tails of the contagion process. If vector c = (C 7 - 1 ,i 1 (a), 

then ji and ji are isomorphic with respect to the contagion 
process. 

The total number and total length of paths in the cascade 
from seed i p to node j is given by Cj,i (1) = f(j,i P , 1) and 
Lj 7 i p (l) — l(j,i P ) in Eq.|5]and Eq.[3]. Hence, the total num- 
ber of paths, total length, and average path length for the 
entire contagion process is given by J2i p Z)j^i p v P Q,ip (1), 



Algorithm 1 Efficient algorithm for computing the Conta- 
gion and the Length Matrix 
Input 

A c : Adjacency Matrix of the Cascade Graph 

a: transmission probability 

Output 

C(a): Contagion Matrix (Nxk), L(a): Length Matrix (N x k) 
p column in C and L, corresponds features of the cascade 
generated by the p th seed. 

Vp G [1, k], j is the label of the p th seed activated at time tj. 
Ci :P (a) is the cascade generating value for node i with respect 
to the p th seed. Li^ p (a) at a = 1 gives the total length of paths 
from the p th seed to the node i. VpCj iP (a) = 1, Lj^ p (a) = 1 
if i < j then 

C ijP (a) = 0,L i>p (a) = 
else 

if i == j + 1 then 

Ci, p (a) = aA c (i,j),L ijP (a) = aA c (i,j) 
else 

i—j — l 

Ci, p (a) = aA c (i,j) + ^ aA c (i,i - 

Wedges e(i — k,i)\k = l 

k)Ci- kj p(a) 

i—j — l 

Li, p (ct) = aA c (i,j) + ^ aA c (i,i - 

Wedges e(i — k,i)\k = l 

k)(Ci_ kjP (a) + Li_ k)P (a)) 
end if 
end if 



As can be seen in Eq. [5] analogous to the the length 
matrix, we have devised an efficient algorithm to calculate 
the diameter. Due to lack of space, we do not provide the 
algorithm here. Since its formalism is very similar to that 
of the length matrix, computation has comparable runtime 
and space complexity. 

2.3 Analyzing Cascades 

Plotting the cascade generating function <j>(j, a) vs time 
(j) shows how the structure of the cascade evolves over time. 
Fig. [2] illustrates the cascade plots computed for a variety 
of contagion processes, which include several prototypes of 
cascades frequently observed in recommendation and blog 
networks [TJ [5] . We label nodes in the order in which they 
are activated and take 4>(i,a) = 1 when i is the seed of 
the cascade, thus giving equal weights to all cascades in the 
contagion process. Without loss of generality, in this study, 
we set the value of a to 0.5. Future work includes estima- 
tion of the transmission rate empirically from the network. 
We show that cascade plots contain as much information as 
cascade graphs, but can be used to analyze the structure 
and evolution of even large cascades, for which visualiza- 
tion is not feasible. In addition to showing the cascade plot 
for each cascade (row (a)), Fig. [5] also reports some of the 
macroscopic properties of the cascade (row (c)), such as to- 
, tal number of paths and their length, average path length, 
and diameter. Note that this is not the exhaustive list of 
properties that can be calculated using the cascade char- 
acterization function. Row (d) lists groups of isomorphic 
nodes in each cascade. 

Cascades (l)-(3) in Fig. |5]are three of the commonly ob- 
served patterns, such as a star (Fig. [2^1)), a chain (Fig. [2^2)), 
and a community (clique) (Fig. [2^3)). In the star-like conta- 
gion process, Fig. [5^1), nodes activated by m have the same 
value of 0, and form an isomorphic group. Interchanging 



the order of their activation does not affect the value of or 
the cascade plot. In the chain-like contagion process, cas- 
cade function decreases as the chain becomes longer. There 
are no isomorphic nodes. In the clique- like contagion pro- 
cess, the value of the cascade function grows in time as more 
paths are created in the cascade. There are also no isomor- 
phic nodes in this cascade. 

In the contagion process in Fig. [2^4), nodes activated at 
t — 2 and t — 3 are isomorphic, therefore, the evolution 
of this cascade is indistinguishable from the cascade shown 
in Fig. [2J5) . However, if the shape of the cascade is the 
same, but nodes are activated in different order, Fig. [2^6), 
the cascade plot and its structure are different. This is be- 
cause in the contagion processes (4) and (5), cascade widens 
first (it is star-like), before lengthening, while in the con- 
tagion process (6), cascade lengthens first (it is chain-like), 
before widening. Similarly, the cascade (7) first deepens, 
then widens, opposite of cascade (8), while cascade (9) al- 
ternates between deepening and widening. In none of these 
cascades (except (3)) are there multiple paths to a node. 
Once this happens, as in cascades (10) and (11), the value 
of the cascade function increases. 

We can also disentangle multiple cascades co-occurring in 
a contagion process. Contagion processes (12)-(14) contain 
multiple cascades, whose cascade functions are shown in dif- 
ferent color. Note that in the contagion process (12), node 4 
is isomorphic to 3 and 7 with respect to the cascade initiated 
by 1, and it is isomorphic to 5 with respect to the cascade 
initiated by 2. 

2.4 Reconstructing Cascades 

Given the contagion matrix, it is possible to reconstruct 
the contagion process with a high level of accuracy. The 
cascade generating function 0, compresses information, and 
has a space complexity of 0{KN) where K is the number 
of seeds in the contagion process. How well does the com- 
pressed representation capture the contagion process? 

Using with < a < 1, a tier- level reconstruction of the 
contagion process is possible. This reconstruction does not 
remove degeneracy of isomorphic nodes. Temporal ordering 
of the nodes, help us to fine-tune the tier- level reconstruc- 
tion. Additional information, such us the number of nodes 
m and their indegree and outdegree can help us further 
improve the approximation. In all the examples shown in 
Fig. [2] using just 0, we are able to obtain the exact tier-level 
reconstruction in all cases. To illustrate, consider Fig. ^ 
taking 0(1) = 0(2) = 1, we get 0(3, a) = 0(7, a) = (a,0), 
where a is the value of the cascade function for the cascade 
initiated by seed 1, and is the value of the cascade function 
for the cascade initiated by second seed, node 2. Likewise, 
0(4, a) = (a, a), 0(5, a) = (0,a) and 0(6, a) = (a + a 2 ,0). 
Hence we can reconstruct that nodes 1 and 2 are indepen- 
dent seeds, 3 and 7 are connected to only node 1 and 5 is 
connected to only node 2. Node 4 is connected to both 1 
and 2. Node 6 is connected to 1 and to that tier of nodes 
containing node 3 and 7. However due to the temporal ar- 
rangement of the nodes, we know that 6 is activated before 
7, hence it is necessarily connected to node 3. Thus we are 
able to obtain the exact reconstruction of the cascade. 

In Fig. [2] using just 0, we are able to obtain the exact 
tier-level reconstruction for cases 1, 4, 5, 7, 8, 9,10, 11 and 
13. In most of these cases we are also able to disambiguate 
between isomorphic nodes in the same tier. For cases 2, 3, 



6, 12, and 14, we are also able to obtain exact node-level 
reconstruction of the cascade graph. 

Space and time complexity. 

Clearly, as demonstrated by the discussion above, knowing 
the values of at different times allows us to deduce the dy- 
namics of a cascade, and reconstruct its structure (up to the 
degeneracy that exists for isomorphic nodes). Storing the 
shape of the cascade has 0(N 2 ) space complexity. However, 
as demonstrated above, the cascade generating function can 
reconstruct this shape with high degree of accuracy. Having 
a pseudo- linear space complexity O(KN), it provides an effi- 
cient compression of this information. Besides, this model is 
general, because the same model can be used to investigate 
cascades in information flow, epidemics, computer viruses, 
and so on. This method is fast having O(dKN) runtime 
complexity even in its naive implementation where d is the 
maximum degree of any node. Moreover, the cascade gener- 
ating function of a node activated at time t, depends only on 
the cascade value of his friends activated before him. Hence 
can be calculated real-time and is appropriate even for ap- 
plications which require streaming, online or near real-time 
analysis of cascades. 

3. DIGG CASE STUDY 

We use the framework described above to study informa- 
tion spread on the social news aggregator Digg which allows 
users to post and vote for news stories. Digg users can also 
create social networks by adding others as friends. Digg 
highlights the stories a user's friends posted or voted for by 
marking them with a green ribbon and also displaying them 
on the Friends Interface, a special page for watching friends' 
activity. A fan may then see the story, and if she decides to 
vote for it, the story then becomes visible to her own fans, 
who may in turn vote for it, etc. By voting for a story, a 
user may influence her fans to also vote it [12]. The spread 
of a story through the social network of Digg is a conta- 
gion process that generates many cascades. The submitter 
is the seed of a cascade. However, there can be other means 
through which the story can reach a user. For instance, the 
user could independently find it on one of Digg's web pages 
or through a link from an external site. If a user find the 
story through other means than the friend's interface, she 
becomes an independent seed for another cascade. Not all 
seeds, however, generate no n- trivial cascades. If a voter is 
unconnected or does not influence at least one of her fans to 
vote, the story does not spread. An independent user who 
generates a non-trivial cascade is its active seed. 

We used Digg API to collect data about 3,553 stories pro- 
moted to the front page in June 2009. The data associated 
with each story contains its title, id, link, submitter's name, 
submission time, list of voters and the time of each vote, 
and the time the story was promoted to the front page. In 
addition, we collected the list of voters' friends^ We define 
an active user as a person who votes for at least one story. 
In our data set there are 139,410 active users. Next, we get 
the connections between the active users. We say user a is 
connected to user b if she is either a friend or fan of user 
b. We store an active user in active users network if she is 
connected to one or more active users. Hence, we are able to 

1 Thi s data is availabl e for research purposes at 
http: / / www.isi.edu/ ^lerman/downloads / digg2009.html 



determine, whether a user who votes for a particular story 
is a fan of any previous voter. Out of all active users, 69,524 
users connected to one or more active users. These 69,524 
connected users form the underlying friendship network. Of 
these, 57,908 users form a giant connected component. 572 
of the 587 distinct submitters belong to this friendship net- 
work. 

We treat each story as an independent contagion process. 
We arrange all voters in the temporal order in which they 
voted for the story and extract the underlying social network 
of these voters. Let n s be the number of active seeds of the 
contagion process of a story s. We take each active seed to be 
independent of other seeds. Therefore, we can quantitatively 
characterize the cascade by an n s x 1 vector Cj for every node 
j participating in the contagion process. Transmission rate 
a can be derived empirically from the network. In this work, 
without loss of generality, we set the value of a to 0.5. We 
use the framework described above to study the macroscopic 
properties of cascades on Digg, such as the distribution of 
cascade size, diameter, etc. We also study the dynamics of 
evolution of cascades associated with some sample stories. 

3.1 Macroscopic Cascade Characteristics 

The stories in our data set generated 216,088 distinct in- 
formation cascades on the Digg social network. Using the 
formalism described above, we calculate global properties of 
these cascades and plot their distribution. These properties 
include cascade size, spread, diameter, etc. Due to lack of 
space, we have included in this paper just some examples of 
the many properties that we can calculate using cj). 

To fit continuous distributions to discrete data, we treat 
a discrete distribution as if it was generated from a contin- 
uous probability density function and then rounded to the 
nearest integer. We do not use the commonly used meth- 
ods such as least square minimization, because the data that 
spans many orders of magnitude and least square minimiza- 
tion can produce substantially inaccurate estimation of pa- 
rameters of heavy-tailed distributions like the power- law El. 
We use Maximum Likelihood Parameter Estimation (MLE) 
to estimate the values of parameters for these distributions 
and KS statistics to test the goodness of fit. The closer the 
KS-statistics to 0, better the fit. We study the following dis- 
tributions: lognormal F(x;fjb,a) = 0.5erfc[- lj fj= L ], Weibull 

F(x;k,X,rj) = (l-e~^ )k ) , mixed Weibull 



F(x;cti,ki, Xi) = y^aj(l - e 



with X^r=i ai ~ 1' an d power-law 
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More often power law applies only for values greater than a 
certain minimum Xmin • In such cases the tail of the distribu- 
tion follows the power law. Using the MLE estimates of x m in 
and scaling parameter a, we find what percent of the data 
comprises this tail of the distribution. We also investigated 
distribution fitting using the Double Pareto Lognormal dis- 
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where A(#,/i,cr) = e e ^ + 2 . Double Pareto Lognormal 
(DPLN) distribution with a = 2.8, f3 = 1.9, fi = 3.0941, 
a = 0.3119 gives the best fit for the number of cascades 
(better than any of the distributions shown in Table [I] ) 
with likelihood of -15.234 x 10 3 and KS statistic of 0.0109. 



10" 1 



-x 10 
ii 

x: 

^ 10" 3 




10 



10 u 



x 10 ' 
ii 

^10"' 



10 




10 10 10 10 10 

X 

num cascades 



10 



10 

x 

cascade size 



10 




ave. path 



log(num paths) 



Figure 3: PDF of distribution of cascade proper- 
ties: number of cascades per story, cascade size, 
spread, diameter, average path length, and log of 
the number of paths. Distributions are fitted with 
the stretched exponential/ Weibull (black), mixture 
of Weibull (cyan), lognormal (red) and power law 
(green) functions. The double pareto lognormal 
distribution(magenta) gives a very good fit for the 
number of cascades. 

Fig. [3] shows the distribution of several macroscopic prop- 
erties of the information cascades on Digg, along with func- 
tions that best describe them. Table [T] shows the MLE esti- 
mates of these distributions. 



Table 1: Parameter estimates for distributions that best describe data. 
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We observe that lognormal or stretched exponential gives 
a good fit with the observed distribution, and that power 
law mostly (if at all) accounts for a small percentage at the 
tail of the distribution. This indicates that a small number 
of core users may not be driving information propagation in 
online social networks on the whole. However, as the cas- 
cade size increases, some users may have disproportionate 
influence on information propagation. Lognormal distribu- 
tion indicates that the distribution might be generated by 
a multiplicative effect of many i.i.d random variables. Fol- 
lowing the Fisher-Tippet-Gnedenko theorem, the stretched 
exponential distribution is the limit distribution of properly 
normalized extrema of a sequence of i.i.d random variables. 
Hence the distribution may have been generated by the ex- 
treme value of the a set of i.i.d random variables. A very 
good fit of the distribution of number of cascades with the 
DPLN distribution suggests a possible relationship between 
the distribution of number of cascades and geometric Brow- 
nian motion. Future work includes, delving deeper into the 
probable causes of these distributions. 

3.2 Microscopic Cascade Characteristics 

The cascade generating function cj) is an effective tool not 
only for computing the global properties of cascades, but 
also for analyzing their microscopic dynamic signatures. In 
previous works, this was done by visualizing individual cas- 
cades [ll or by creating a generative model of the contagion 
process Visualization, however, quickly becomes diffi- 

cult, even for moderately-sized cascades. Generative models 
are ad hoc in nature, and while they are designed to produce 
cascades with similar macroscopic properties as the observed 
cascades, they are not guaranteed to reproduce their micro- 
scopic characteristics. The cascade generating function, on 
the other hand, allows us to study microscopic properties of 
even very large cascades without the need to visualize them. 

We illustrate the use of cascade plots to study microscopic 
dynamics of cascades with four different stories. Story 1, 
titled "Infomercial King' Billy Mays Dead at 50" was sub- 
mitted by a user who had 760 fans. This story was among 
the most popular in our data set, receiving 8,471 votes, of 
which 1,244 were from fans. The contagion process of this 
story generated 853 cascades. Its diameter was 46, spread 
412, and the average path length 24. Fig.[4|a) shows evolu- 
tion of the cascade function </>(£) of the top three cascades, 
ranked by their largest <j> value. The left-hand set of plots 
shows the early dynamics of the cascade (t < 100), while 
the right-hand set of plots shows cascade dynamics over the 
entire time period. The seed of the cascade is shown in red. 

The top cascade attains its largest value of <j> = 3.554 x 10 7 . 
This cascade started early in the contagion process. Though 
the seeds of the next two cascades were also activated within 
the first 100 votes, these cascades did not start growing un- 



til later. Values of <fi > 1 imply that the voter is a fan of 
two or more previous voters. Large values of cj) in Fig. [4^a) 
indicate a community effect (cf Fig. [2^3)). This implies that 
information is spreading within an interconnected fan net- 
work. Though initially the three cascades of Story 1 are very 
different, in their later stages, they become increasingly sim- 
ilar. This is due to mixing caused by "collision of cascades," 
which happens when the same nodes participate in different 
cascades. 

The popularity of Story 2, titled "Bender's back," is com- 
parable to popularity of Story 1. Story 2 received 8,034 
votes of which 1,464 were from fans and generated 722 cas- 
cades. Its diameter was 26, spread 401, and the average 
path length 12. Fig. [3Jb) shows both the early and late- 
stage dynamics of the top three cascades generated by this 
story. However, the largest value of cj) attained by any cas- 
cade was just <j) = 1859.4, four orders of magnitude smaller 
than for Story 1. This indicates a much lower connectivity 
of the underlying fan network. In the three dominant cas- 
cades of this story, cj) does not rise above 2.5 during the first 
100 votes. Low values of <j> in the initial stages of cascade 
evolution imply a chaining effect (cf Fig. [2^1)), or cascade 
growth by deepening. Unlike Story 1, here the seed of the 
dominant cascade is the 19 th voter. However, as seen from 
the larger values of cj) in the cascade plots, in the later stages 
of information spread, community effect also comes into pic- 
ture. 

The third story in Fig. [4|c) is titled "Play Doctor On 
Yourself: 16 Things To Do Between Checkups." While this 
story was submitted by a well-connected user (with 1,701 
fans) it did not become popular, receiving only 390 votes of 
which 158 were from fans. This story generated 11 cascades, 
and its diameter was 48, spread 5, and the average path 
length 25. All of the first 100 voters participated in the 
dominant cascade, one initiated by the submitter himself. 
The maximum value reached by this cascade was very 
high (<f> = 7.53 x 10 7 ), even though this cascade was of short 
duration. Unlike in previous stories, we observed very high 
values of <j> already within the first 100 votes, which indicates 
strong community effect, and high connectivity within the 
fan network. 

For the final illustration we consider the story titled "APOD: 
2009 July 1 - Three Galaxies in Draco," shown in Fig.|4jd). 
The submitter of this story has only 27 fans. This story is 
one of the least popular in our data set, receiving only 199 
votes, of which 27 were from fans. This contagion process 
generated eight cascades, its diameter was 7, spread 7, and 
the average path length 2.6. In the early stages, constant 
values of <j> in the dominant cascade (top plot in Fig. [2Jd)) 
indicate a branching effect (cf Fig. [2j 1) ) . This implies that 
cascade is growing in a star-like fashion, rather than deepen- 
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Figure 4: Shows the cascade plot for top 3 cascades for four stories. The left set of plots in each figure shows 
cascade evolution in the early stages of the contagion process, while the right set of plots shows cascade 
evolution over the entire time period. Red dot shows the time when cascade seed was activated. 



ing. The decreasing values of <j> of the third cascade (bottom 
plot in Fig. |4jd)) indicate a chaining effect, implying that 
this cascade is deepening. We do not observe the community 
effect either in the initial or later stages of this contagion 
process. The maximum value of <j) for this story is two and 
the average path length is 2.6, indicating that most of the 
voters are the fans of the submitter or submitter's fans, but 
are not themselves interconnected. 

In summary, cascade plots can tell us much about the 
microscopic evolution of information cascade. Popular sto- 
ries that have large participation also generated many cas- 
cades and had high spread. Initially they showed chaining 
and branching effects, as evidenced by cj) values that are 
decreasing or staying constant in time, respectively. The 
community effect, manifested by growing values of 0, is visi- 
ble in later stages when a story penetrates and then spreads 
through a community. The trends of the dominant cascades 
grow increasingly similar with time due to the mixing effect 
of "colliding cascades." However stories that do not become 
popular generate very few cascades and have low spread. 
When submitter is well connected, the community effect is 
visible in all stages of the contagion process, implying that 
the story spreads within submitter's community only. How- 
ever, when submitter is poorly connected, cascades grow by 
chaining and branching. 

4. RELATED WORK 

Most of the earlier work does not clearly distinguish be- 
tween cascades and the contagion processes generating these 
cascades. We believe that ours is the first work studying 
large scale cascades without link ambiguity. Though large 
scale studies of information cascades have been carried out 
earlier 12], the cascades in general were small in size (O(10)). 
We on the other hand, have very large cascades (extending 
up to O(10 4 )). The quantitative framework for analyzing 
cascades that we present here is very scalable and can be 
easily used to provide a efficient compressed representation 
of large cascades, when storing the complete information of 
the entire cascade is no longer trivial. In previous stud- 
ies [l] [2], the authors characterize the cascades/contagion 
processes using a multilevel approach comprising of global 
and local signatures. Cascades are considered to be approx- 
imately isomorphic if they have the same global signature. 
If the global signatures match, more expensive isomorphism 
tests based on local signatures are carried out. To aid rea- 
soning about cascades, the authors focus on local cascades, 



which they define as the 'cascade in the (undirected) neigh- 
borhood of the node', which for every node is the subgraph 
induced on the nodes reachable from it. They enumerate the 
shapes of these local cascades. As cascades grow in size, the 
number of possible shapes increases exponentially and such 
enumeration becomes infeasible. Note that in this work we 
provide a scalable, efficient and compressed representation 
of the observed cascades and make no claims about whether 
the observed cascades are the actual cascades or fragments 
of them. 

In our study of Digg, we have cascades of size up to 
~ 20, 000. We aim to deduce their qualitative as well as 
quantitative properties, such as shape and size. Hence a 
more formal framework for characterizing cascades is re- 
quired. In this paper we provide such a formalism, which 
not only captures the macro and micro level signatures de- 
scribed above, but much more. For instance, it captures the 
similarity between cascades, which were initially similar but 
later become dissimilar; or, similarity between cascades that 
are similar in some stage of their growth. This formalism 
enables us to distinguish between cascades and obviates the 
need for enumerating them or drawing their shape. 

In |13| [TT] , the underlying network on which information 
spreads is not observed, but has to be inferred from the ob- 
served cascades. However such inferences [T3J are based on 
the hypothesis that the contagion process follows an inde- 
pendent cascade model [2lj . Our work, on the other hand, 
focuses on providing a quantitative tool to analyze the trends 
and patterns of actual contagion processes observed on real- 
life networks. Even when the underlying network is pre- 
dicted using a different inference methods, e.g., [l3 11 , the 
trends of the contagion process occurring on the network 
can be investigated using the cascade generating function. 
Future work includes using these tools to aid the verification 
or rejection of the hypothesis used for modeling information 
spread [3] p] ^] • It can also prove to be an effective tool to 
evaluate the robustness of inferred networks [13] . 

As demonstrated by the third story in our examples, we 
observe that if the submitter is well connected, the com- 
munity effect is visible at all stages. However, initial popu- 
larity only within the tightly knit community (shown by a 
high cascade value and few seeds in the initial stages) does 
not ensure global popularity (large number of votes). In 
contrast, stories submitted by a not so well connected user, 
which spreads by branching and deepening initially (with 
low cascade values) , but have larger number of initial active 
seeds become more popular globally (as shown by the sec- 



ond story in the example) . This observation is in agreement 
to those reported in [20j [6] that content diffusing primarily 
through an interconnected community tends to be confined 
to that community. These cascades are also complicated by 
the interplay between social inff uence and homophily [7] [To] . 
Future work will address these questions more closely. 

In previous works [T] [§] , the cascade size was found to be 
described well by the power-law distribution. However, we 
observe that power-law only accounts for a small fraction of 
cascades at the tail of the distribution. Rather, the entire 
data can be approximated well with a stretched-exponential 
(weibull), lognormal or double pareto lognormal distribu- 
tions, similar to those observed in [5]. 

5. CONCLUSION 

In this paper we adopt an empirical approach to study 
cascades on networks. We believe that our work is first 
to provide a mathematical framework to quantify and an- 
alyze cascades, even for applications requiring real-time or 
online analysis. The mathematical framework is based on 
the cascade generating function, which quantitatively char- 
acterizes both the micro and the macroscopic properties of 
the cascade. The macroscopic properties that can be effi- 
ciently calculated using this tool include the diameter and 
the spread of the cascades. This function also provides an 
efficient compression of the information encoded in cascades. 
In spite of having pseudo-linear space complexity, it can be 
used to reconstruct the shape of the cascade with high de- 
gree of accuracy. 

Although large scale studies of cascades have been carried 
out, the size of cascades in these studies was relatively small. 
To the best of our knowledge, this is the first study of very 
large cascades with thousands of participants. We use this 
function to study information cascades on an online-social 
news aggregator Digg. For macroscopic properties like num- 
ber of cascades in a contagion process, cascade size, spread, 
diameter, average length and so on, we observe a stretched 
exponential (Weibull) or a lognormal distribution fits well 
with the observed distribution. Double Pareto Lognormal 
distribution gives a very good fit for the distribution of num- 
ber of cascades. Usually power law accounts (if at all) for 
a small percentage of data in the tail of the distribution. 
Microscopic analysis also revealed interesting insight to cas- 
cades and contagion processes, such as the possible effect of 
the initial number of seeds and of the branching, chaining 
and community effect on the initial popularity of news. 
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